Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add mimalloc #9

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

add mimalloc #9

wants to merge 1 commit into from

Conversation

rob-p
Copy link

@rob-p rob-p commented May 29, 2024

I created this branch to play around with potential optimizations. However, right now this PR just contains one (very trivial) optimization. It replaces the global allocator with Mimalloc. This is a highly-optimized allocator that generally outperforms the system malloc. Given the propensity of Granges to perform many small allocations, such a change seems to help. I'd be very interested in your measurements!

@jianshu93
Copy link

Hi Rob,

I saw 5% to 10% improvement at least using my random test dataset. I think it is worth it. Thanks for bring the mimalloc into attention. I have some problem with jemalloc but it seems mimalloc is perfectly compatible with system default malloc while being faster.

Thanks,

Jianshu

@molpopgen
Copy link
Collaborator

I would suggest that binaries including granges should define their global allocators in the file containing main. Unless things have changed, you can run into a mess when library crates define global allocators. For example, if a Vec<_> from another crate were passed to granges, they may not be compatible due to different global allocatorrs.

@jianshu93
Copy link

Yes exactly. The best way is to add a feature like use_mimalloc. So that users can choose whehter they want to use it or not. I have tried in other crates with very frequent heap allocation. Let me know if this makes sense.

Thanks,

Jianshu

@molpopgen
Copy link
Collaborator

There is a matter of taste/style here. The other allocator may improve the performance of the current binary. If that appeared generally true (e..g, not just on macos or whatever), then maybe it should not be feature gated for the binary. The reason is that, imagine allocator A is best on some hardware, and B is best on another. Once you feature-gate in mimalloc ("A"), you cannot add a feature in for B, because you break the expectation that features are strictly additive.

@rob-p
Copy link
Author

rob-p commented May 31, 2024

These are good points. I think the claim is that it makes sense to add the global allocator to the granges binary, and then suggest (e.g. in the README and documentation) that consumers of the library might consider adopting this allocator for best performance.

One thing that I wanted to mention to @jianshu93 is that, while mimalloc provides a nice bump in the current testing, the gains are often even larger in the multi-threaded context — it's improvement in allocation under thread contention is even greater.

@molpopgen: I don't think the improvements of mimalloc over the standard allocator are particularly hardware dependent. Their benchmarks (and my experience) suggest its generally faster at the same basic memory usage.

@jianshu93
Copy link

Hello both,

I actually test it via several different machines in a parallel setting. Intel(R) Xeon(R) Gold 6226, AMD EPYC 7713, AMD EPYC 9534 . The performance bump is more pronounced on the most recent AMD EPYC 9534 for several crate, e.g, graph (skip list, well-known for current/parallel) as the underlying structure and also this crate. I never saw a 20% improvement or something, most around 8% to 15%. Should be interesting to test more different data structure because the mimalloc paper reported only for several give structures.

Thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants